Cs 267 Homework 1: Matrix Multiply

نویسندگان

ARMANDO SOLAR-LEZAMA

MARK HOEMMEN

چکیده

1.1. Method 1: Copy transpose DGEMM. Strided accesses pose a critical bottleneck in the naive implementation of matrix multiply. Both A and B are stored in the same order (column-major), but the multiply operation requires that entries of either A or B be loaded with stride M . (Without loss of generality, assume the A matrix.) Large strides result in ineffective use of cache lines, since (for sufficiently large M) each consecutive entry in a row of A wastes an entire cache line. This effectively reduces the total cache size, and exacerbates the effects of aliasing (especially when M is not co-prime with the number of cache lines). One possibility is to transform the A matrix into row-major order, by making a transposed copy of A in a pre-allocated buffer. Transposing A allows matrix multiply using unidirectional unit-stride loads of both A and B, with matrix multiply arranged as a sequence of dot products. This takes full advantage of spatial locality at all cache levels, permits optimizations such as prefetching, and uses cache space optimally. However, transposing the matrix doubles the total number of memory accesses: an entire matrix must be loaded and stored again. Furthermore, if the transpose itself is done “naively,” that is, if the elements of A are simply read in row order and copied, the same problem of strided memory accesses returns. In fact, the problem is exacerbated, because the loads and stores cannot be interleaved with floating-point operations to hide latency. The effort of developing a “clever” transpose scheme to increase locality might be better spent by modifying the standard matrix multiply algorithm to improve locality without copy operations. Despite these concerns, there is potential for the method. As a first approximation, suppose that the transpose operation is only limited by the memory bandwidth (6.4 GB/s on the Itanium 2 platforms tested), and the dot product matrix multiply runs at peak (5.2 Gigaflop/s). The latter is a fair assumption, given the success of tuned non-copying DGEMM kernels at reaching high fractions of peak. Each entry C(i, j) of C requires M adds and M multiplies, so exactly 2M floating-point operations are needed. For example, if M = 1024, then the matrix multiply alone uses 2 flops, which at peak takes about 0.41 seconds; the transpose calls for 2M = 2 memory operations, which takes 2.6× 10−3 seconds, that is, less than one percent of the cost of the matrix multiplication. In fact, standard (non-transposing) DGEMM must run at over 99% of peak to attain the same execution time.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 267 Final Report Reproducible Parallel Matrix-Vector Multiply

Parallel code can be difficult to verify due to inherently non-reproducible execution models. When debugging or writing tests, users could benefit from getting the same result on different runs of the simulation. This is the goal that the ReproBLAS project (Nguyen et al.) intends to achieve. ReproBLAS [3] has so far introduced a reproducible floating point type (indexed float) and associated al...

متن کامل

Sample Solutions to Homework # 3

“FIBONACCI” directly implements the recurrence relation of the Fibonacci sequence. Each number in the sequence is the sum of the two previous numbers in the sequence. The running time is clearly O(n). The subproblem graph consists of n+ 1 vertices, v0, v1, ..., vn. For i = 2, 3, ..., n, vertex i has two leaving edges: to vertex vi−1 and to vertex vi−2. No edges leave vertices v0 or v1. Thus, th...

متن کامل

CS 1951-k and CS 2951-z: Homework 3

متن کامل

CS 5314 Randomized Algorithms

Homework 1 (Suggested Solutions) 1. Ans. Use principle of deferred decision. Let X i denote the outcome of the i-th die so that Pr(X i = k) = 1/6, where 1 ≤ k ≤ 6.

متن کامل

Reconstruction of sparse signals from l1 dimensionality-reduced Cauchy random-projections

Dimensionality reduction via linear random projections are used in numerous applications including data streaming, information retrieval, data mining, and compressive sensing (CS). While CS has traditionally relied on normal random projections, corresponding to 2 distance preservation, a large body of work has emerged for applications where 1 approximate distances may be preferred. Dimensionali...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Cs 267 Homework 1: Matrix Multiply

نویسندگان

چکیده

منابع مشابه

CS 267 Final Report Reproducible Parallel Matrix-Vector Multiply

Sample Solutions to Homework # 3

CS 1951-k and CS 2951-z: Homework 3

CS 5314 Randomized Algorithms

Reconstruction of sparse signals from l1 dimensionality-reduced Cauchy random-projections

عنوان ژورنال:

اشتراک گذاری